Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Deep natural language description method for video based on multi-feature fusion

LIANG Rui, ZHU Qingxin, LIAO Shujiao, NIU Xinzheng

Journal of Computer Applications 2017, 37 (4): 1179-1184. DOI: 10.11772/j.issn.1001-9081.2017.04.1179

Abstract （520）

PDF （999KB）（566）

Save

Concerning the low accuracy of automatically labelling or describing videos by computers, a deep natural language description method for video based on multi-feature fusion was proposed. The spatial features, motion features and video features of video frame sequence were extracted and fused to train a Long-Short Term Memory (LSTM) based natural language description model. Several natural language description models were trained through the combination of different features from early fusion, then did a late fusion when testing. One of the models was selected to predict possible outputs under current inputs, and the probabilities of these outputs were recomputed with other models, then a weighted sum of these outputs was computed and the output with the highest probability was used as the next output. The feature fusion methods of the proposed method include early fusion such as feature concatenating, weighted summing of different features after alignment, and late fusion such as weighted fusion of outputs' probabilities of different models based on different features, finetuning generated LSTM model by early fused features. Comparison experimental results on Microsoft Video Description (MSVD) dataset indicate that the fusion of different kinds of features can promote the evaluation score, while the fusion of the same kind of features cannot get higher evaluation score than that of the best feature; however, finetuning pre-trained model with other features has poor effect. Among different combination of different features tested, the description generated by the method of combining early fusion and later fusion gets 0.302 of METEOR, which is 1.34% higher than the highest score that can be found, it means that the method is able to improve the accuracy of video automatic description.

Reference | Related Articles | Metrics